The aim of these exercises is to improve your ability to deal with multi-predictor linear models where the predictors are all continuous or a mixture of continuous and categorical.
Let’s continue with the example from Chapter 7 assessing whether dietary supplements improve the perceived health of dogs with osteoarthritis. The model we focused on at the end of that exercise was one modelling the pain index of dogs after 60 days as a function of whether they received dietary supplements or a placebo and the sex of the dog. The dogs unavoidably varied in body weight, ranging from 14-47 kg. To partly account for this, the authors adjusted the doses to a constant amount per kg of body weight. However, you can probably think of a range of ways in which weight might affect osteoarthritis. The model using treatment and sex fitted the data fairly well, with r2 around 0.6. We detected a strong treatment effect, but it is possible that if we reduced background noise, we might see sex-specific responses and we’d also get a more precise estimate of the effects.
Think about the steps you’d take to see if it would be helpful to include body weight in the model, then go back to the data and run the analysis.
Start by reloading the data file from Chapter 7 exercises
df <- read.csv("data/martello.csv")
head(df,10)
#Tidy up the names if original data used
#df_names<-c(GROUP="group", sterilizzato = "ster", BCS = "bcs", PESO.KG = "wt", ETA = "eta", RAZZA = "breed",
# HCPI = "hcp0", HCPI.4 = "hcp40", HCP.6 = "hcp60",
# SEGNI.OA.VET = "vet0", SEGNI.AO.VET.4 = "vet40", SEGNI.AO.VET.6 = "vet60")
#df<-rename(df, all_of(df_names))
# make sex, treatment, and sterilization factors
df$sex<-as.factor(df$sex)
df$group<-as.factor(df$group)
df$ster<-as.factor(df$ster)
#set contrasts to sum
contrasts=list(group=contr.sum, sex=contr.sum, ster=contr.sum)
The model we had in that exercise linked final owner-assessed health (hcp60) to treatment (group), sex, and possibly sterilization status (ster).
For simplicity, let’s use the treatment/sex model, and start by running it for comparison
martello.lm <- lm(hcp60~group*sex, data=df)
summary(martello.lm)
Call:
lm(formula = hcp60 ~ group * sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-10.6364 -2.6667 -0.4545 3.1667 9.3636
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.636 1.439 21.288 < 2e-16 ***
groupTRT -9.970 2.145 -4.647 4.39e-05 ***
sexM -3.747 2.145 -1.747 0.0892 .
groupTRT:sexM 1.535 3.034 0.506 0.6159
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.773 on 36 degrees of freedom
Multiple R-squared: 0.5485, Adjusted R-squared: 0.5108
F-statistic: 14.58 on 3 and 36 DF, p-value: 2.249e-06
emmeans(martello.lm, ~ group:sex)
group sex emmean SE df lower.CL upper.CL
CTR F 30.6 1.44 36 27.7 33.6
TRT F 20.7 1.59 36 17.4 23.9
CTR M 26.9 1.59 36 23.7 30.1
TRT M 18.5 1.44 36 15.5 21.4
Confidence level used: 0.95
Start by checking covariate ranges
boxplot(wt~group*sex, data=df)
Some slight differences, including sex-based ones, but ranges overlap, so probably OK
Now run model and look at residuals. Try model with 3 slopes terms first
martello2.lm <- lm(hcp60~group*sex*wt, data=df)
plot(martello2.lm)
summary(martello2.lm)
Call:
lm(formula = hcp60 ~ group * sex * wt, data = df)
Residuals:
Min 1Q Median 3Q Max
-9.7534 -3.0565 0.0158 3.2479 9.1178
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.6310 8.5198 4.769 3.88e-05 ***
groupTRT -27.2766 11.8725 -2.297 0.0283 *
sexM -13.6250 11.1398 -1.223 0.2302
wt -0.3885 0.3262 -1.191 0.2424
groupTRT:sexM 17.1288 14.5297 1.179 0.2471
groupTRT:wt 0.6416 0.4303 1.491 0.1457
sexM:wt 0.3851 0.3825 1.007 0.3215
groupTRT:sexM:wt -0.5837 0.4942 -1.181 0.2463
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.884 on 32 degrees of freedom
Multiple R-squared: 0.5798, Adjusted R-squared: 0.4879
F-statistic: 6.308 on 7 and 32 DF, p-value: 0.0001073
Anova(martello2.lm, type='III')
Anova Table (Type III tests)
Response: hcp60
Sum Sq Df F value Pr(>F)
(Intercept) 542.45 1 22.7433 3.884e-05 ***
group 125.89 1 5.2784 0.02828 *
sex 35.68 1 1.4960 0.23023
wt 33.83 1 1.4185 0.24240
group:sex 33.15 1 1.3898 0.24714
group:wt 53.03 1 2.2235 0.14572
sex:wt 24.19 1 1.0141 0.32149
group:sex:wt 33.27 1 1.3950 0.24627
Residuals 763.23 32
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Nothing here to suggest that slopes are differnt across the categorical predictors, so lets run an intercepts model
martello3.lm <- lm(hcp60~group*sex+wt, data=df)
plot(martello3.lm)
summary(martello3.lm)
Call:
lm(formula = hcp60 ~ group * sex + wt, data = df)
Residuals:
Min 1Q Median 3Q Max
-10.6894 -2.7427 -0.3665 3.0188 9.5440
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.03581 2.94520 10.198 5.06e-12 ***
groupTRT -10.04350 2.19668 -4.572 5.80e-05 ***
sexM -3.96393 2.36158 -1.679 0.102
wt 0.02334 0.09946 0.235 0.816
groupTRT:sexM 1.74285 3.19916 0.545 0.589
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.837 on 35 degrees of freedom
Multiple R-squared: 0.5492, Adjusted R-squared: 0.4977
F-statistic: 10.66 on 4 and 35 DF, p-value: 9.35e-06
Anova(martello3.lm, type='III')
Anova Table (Type III tests)
Response: hcp60
Sum Sq Df F value Pr(>F)
(Intercept) 2433.32 1 104.0040 5.06e-12 ***
group 489.08 1 20.9043 5.80e-05 ***
sex 65.92 1 2.8174 0.1022
wt 1.29 1 0.0551 0.8158
group:sex 6.94 1 0.2968 0.5894
Residuals 818.87 35
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
emmeans(martello3.lm, ~group|sex)
sex = F:
group emmean SE df lower.CL upper.CL
CTR 30.7 1.51 35 27.7 33.8
TRT 20.7 1.61 35 17.4 24.0
sex = M:
group emmean SE df lower.CL upper.CL
CTR 26.8 1.70 35 23.3 30.2
TRT 18.5 1.46 35 15.5 21.4
Confidence level used: 0.95
Body weight plays very little role here. Adding it to the model only raises the variance explained by 0.1%, and not surprisingly, the estimated effects haven’t changed much either.
LaRosa and Connor (Amer. J. Bot., 2017) examined effects of several floral traits on fitness components of milkweeds, Asclepias spp. The fitness components were male and female pollination success and female reproductive success.
In the paper, their analysis focused on 6 predictors, They measured six floral traits, although one of them, hood height, was not relevant for Asclepias viridiflora, which was the species with the largest sample size. - gynostegium width, - hood length, - hood height, - horn reach, - slit length, and - gap width
Their Figure 2 shows what these measurements represent on flowers.
The data are available from Dryad here.
The floral traits were different between species, so data analysis would require each species to be analyzed separately or for the measurements to be standardized within each species.
df <- read.csv("data/larosa.csv")
head(df,10)
df_syr<-subset(df,species=="Asyr")
df_vir<-subset(df, species=='Avir')
Fitness component estimates were relativized by dividing by the mean, and the traits were standardized to a mean of zero and standard deviation of one.
First look at the removals per flower
What checks should you do before assessing the predictors’ effects?
scatterplotMatrix(data=df_syr,~ removals.per.flower + gyn.width + hood.length + hood.height + horn.reach + slit.length + gap.width)
cor(df_syr[,c('gyn.width', 'hood.length', 'hood.height', 'horn.reach', 'slit.length', 'gap.width')])
gyn.width hood.length hood.height horn.reach slit.length gap.width
gyn.width 1.00000000 0.2985275 0.2799940 0.2282321 0.386472597 0.042317618
hood.length 0.29852752 1.0000000 0.4176909 0.6255484 0.262450382 0.157199159
hood.height 0.27999397 0.4176909 1.0000000 0.6098929 0.339856045 0.252321055
horn.reach 0.22823215 0.6255484 0.6098929 1.0000000 0.345905997 0.253276599
slit.length 0.38647260 0.2624504 0.3398560 0.3459060 1.000000000 0.007694012
gap.width 0.04231762 0.1571992 0.2523211 0.2532766 0.007694012 1.000000000
vif(lm(data=df_syr, removals.per.flower ~ gyn.width + hood.length + hood.height + horn.reach + slit.length + gap.width))
gyn.width hood.length hood.height horn.reach slit.length gap.width
1.257452 1.715965 1.708398 2.259544 1.314818 1.100276
Diagnostics OK
syr1.lm<-lm(data=df_syr, removals.per.flower ~ gyn.width + hood.length + hood.height + horn.reach + slit.length + gap.width)
plot(syr1.lm)
summary(syr1.lm)
Call:
lm(formula = removals.per.flower ~ gyn.width + hood.length +
hood.height + horn.reach + slit.length + gap.width, data = df_syr)
Residuals:
Min 1Q Median 3Q Max
-2.07957 -0.49798 0.03519 0.72515 2.15210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.012 3.540 0.568 0.57304
gyn.width -18.935 15.068 -1.257 0.21657
hood.length 12.904 4.958 2.603 0.01311 *
hood.height 13.827 4.762 2.903 0.00612 **
horn.reach -10.923 7.086 -1.542 0.13147
slit.length -16.481 15.667 -1.052 0.29947
gap.width -28.025 14.191 -1.975 0.05559 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.055 on 38 degrees of freedom
Multiple R-squared: 0.3387, Adjusted R-squared: 0.2342
F-statistic: 3.243 on 6 and 38 DF, p-value: 0.01132
If you’re happy with your pre-flight checks, fit the model and make some conclusions about the effects of each predictor, including any notes of caution
Look at results of model fitting
options(digits=3)
tidy(syr1.lm, conf.int = TRUE)
glance(syr1.lm)
Get standardized coefficients with lmbeta
lm.beta.syr1 <- lm.beta(syr1.lm)
tidy(lm.beta.syr1, conf.int = TRUE)
First do insertions.per.flower
scatterplotMatrix(data=df_syr,~ insertions.per.flower + gyn.width + hood.length + hood.height + horn.reach + slit.length + gap.width)
syr2.lm<-lm(data=df_syr, insertions.per.flower ~ gyn.width + hood.length + hood.height + horn.reach + slit.length + gap.width)
plot(syr2.lm)
summary(syr2.lm)
Call:
lm(formula = insertions.per.flower ~ gyn.width + hood.length +
hood.height + horn.reach + slit.length + gap.width, data = df_syr)
Residuals:
Min 1Q Median 3Q Max
-0.417 -0.173 -0.009 0.122 0.571
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0181 0.7873 0.02 0.98
gyn.width -1.1430 3.3512 -0.34 0.73
hood.length 1.7626 1.1026 1.60 0.12
hood.height 1.7360 1.0592 1.64 0.11
horn.reach -2.5932 1.5759 -1.65 0.11
slit.length -0.4547 3.4844 -0.13 0.90
gap.width -3.0716 3.1561 -0.97 0.34
Residual standard error: 0.235 on 38 degrees of freedom
Multiple R-squared: 0.138, Adjusted R-squared: 0.00134
F-statistic: 1.01 on 6 and 38 DF, p-value: 0.433
options(digits=3)
tidy(syr2.lm, conf.int = TRUE)
glance(syr2.lm)
lm.beta.syr2 <- lm.beta(syr2.lm)
tidy(lm.beta.syr2, conf.int = TRUE)
Model explains less of variation; overall rsq much lower. Same two predictors have highest standardized coefficients, but largely noise here
Now do numbers of fruits
scatterplotMatrix(data=df_syr,~ fruits + gyn.width + hood.length + hood.height + horn.reach + slit.length + gap.width)
syr3.lm<-lm(data=df_syr, fruits ~ gyn.width + hood.length + hood.height + horn.reach + slit.length + gap.width)
plot(syr3.lm)
summary(syr3.lm)
Call:
lm(formula = fruits ~ gyn.width + hood.length + hood.height +
horn.reach + slit.length + gap.width, data = df_syr)
Residuals:
Min 1Q Median 3Q Max
-9.142 -3.143 -0.369 2.709 15.094
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.09 17.79 -1.69 0.100 .
gyn.width 102.33 75.42 1.36 0.184
hood.length -9.25 25.11 -0.37 0.715
hood.height 11.39 23.68 0.48 0.633
horn.reach 85.09 39.04 2.18 0.036 *
slit.length -1.08 77.23 -0.01 0.989
gap.width -107.72 72.86 -1.48 0.148
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.11 on 35 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.292, Adjusted R-squared: 0.17
F-statistic: 2.4 on 6 and 35 DF, p-value: 0.0475
options(digits=3)
tidy(syr3.lm, conf.int = TRUE)
glance(syr3.lm)
lm.beta.syr3 <- lm.beta(syr3.lm)
tidy(lm.beta.syr3, conf.int = TRUE)
Still lots of unexplained variation. Only one predictor has much influence (horn reach)
What would you need to check in doing analyses on three different fitness components as response variables?
Check that they aren’t correlated with each other.
cor(df_syr[,c('removals.per.flower', 'insertions.per.flower', 'fruits')])
removals.per.flower insertions.per.flower fruits
removals.per.flower 1.000 0.499 NA
insertions.per.flower 0.499 1.000 NA
fruits NA NA 1
OK; one r=0.5
What do you conclude about the floral traits’ influence on fitness components of this species?
Floral traits affect two of the three components - hood parameters are positively related to pollen removal rates, and horn reach is linked to higher fruit numbers.
Remember that one floral trait, hood height, isn’t relevant for flowers of this species
Just show big code block here.
#Run diagnostics
scatterplotMatrix(data=df_vir,~ removals.per.flower + gyn.width + hood.length + hood.height + slit.length + gap.width)
cor(df_vir[,c('gyn.width', 'hood.length', 'hood.height', 'slit.length', 'gap.width')])
gyn.width hood.length hood.height slit.length gap.width
gyn.width 1.000 0.1286 0.1672 0.24030 0.25034
hood.length 0.129 1.0000 0.1527 0.08002 -0.04063
hood.height 0.167 0.1527 1.0000 0.22310 -0.04564
slit.length 0.240 0.0800 0.2231 1.00000 0.00707
gap.width 0.250 -0.0406 -0.0456 0.00707 1.00000
vif(lm(data=df_vir, removals.per.flower ~ gyn.width + hood.length + hood.height + slit.length + gap.width))
gyn.width hood.length hood.height slit.length gap.width
1.17 1.04 1.09 1.10 1.08
vir1.lm<-lm(data=df_vir, removals.per.flower ~ gyn.width + hood.length + hood.height + slit.length + gap.width)
plot(vir1.lm)
summary(vir1.lm)
Call:
lm(formula = removals.per.flower ~ gyn.width + hood.length +
hood.height + slit.length + gap.width, data = df_vir)
Residuals:
Min 1Q Median 3Q Max
-2.034 -0.463 0.041 0.526 1.549
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.510 1.427 1.06 0.29
gyn.width -0.773 4.754 -0.16 0.87
hood.length 3.647 5.684 0.64 0.52
hood.height 3.613 1.750 2.06 0.04 *
slit.length -2.939 6.213 -0.47 0.64
gap.width -6.765 6.462 -1.05 0.30
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.747 on 206 degrees of freedom
Multiple R-squared: 0.0315, Adjusted R-squared: 0.00803
F-statistic: 1.34 on 5 and 206 DF, p-value: 0.248
#Look at results of model fitting
options(digits=3)
tidy(vir1.lm, conf.int = TRUE)
glance(vir1.lm)
#Get standardized coefficients with lmbeta
lm.beta.vir1 <- lm.beta(vir1.lm)
tidy(lm.beta.vir1, conf.int = TRUE)
#Run through same sequence for the other two life-history traits
scatterplotMatrix(data=df_vir,~ insertions.per.flower + gyn.width + hood.length + hood.height + slit.length + gap.width)
vir2.lm<-lm(data=df_vir, insertions.per.flower ~ gyn.width + hood.length + hood.height + slit.length + gap.width)
plot(vir2.lm)
summary(vir2.lm)
Call:
lm(formula = insertions.per.flower ~ gyn.width + hood.length +
hood.height + slit.length + gap.width, data = df_vir)
Residuals:
Min 1Q Median 3Q Max
-0.2910 -0.1235 -0.0343 0.1059 0.4594
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0259 0.3269 0.08 0.937
gyn.width -0.4290 1.0889 -0.39 0.694
hood.length -0.0425 1.3019 -0.03 0.974
hood.height 0.2837 0.4007 0.71 0.480
slit.length 2.8794 1.4231 2.02 0.044 *
gap.width -2.3061 1.4802 -1.56 0.121
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.171 on 206 degrees of freedom
Multiple R-squared: 0.0393, Adjusted R-squared: 0.016
F-statistic: 1.68 on 5 and 206 DF, p-value: 0.14
options(digits=3)
tidy(vir2.lm, conf.int = TRUE)
glance(vir2.lm)
lm.beta.vir2 <- lm.beta(vir2.lm)
tidy(lm.beta.vir2, conf.int = TRUE)
scatterplotMatrix(data=df_vir,~ fruits + gyn.width + hood.length + hood.height + slit.length + gap.width)
vir3.lm<-lm(data=df_vir, fruits ~ gyn.width + hood.length + hood.height + slit.length + gap.width)
plot(vir3.lm)
summary(vir3.lm)
Call:
lm(formula = fruits ~ gyn.width + hood.length + hood.height +
slit.length + gap.width, data = df_vir)
Residuals:
Min 1Q Median 3Q Max
-3.864 -1.570 -0.405 0.974 14.857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.46 4.73 -0.73 0.46489
gyn.width 15.72 15.90 0.99 0.32412
hood.length 64.00 18.85 3.40 0.00083 ***
hood.height -2.04 6.02 -0.34 0.73444
slit.length 10.27 20.78 0.49 0.62171
gap.width -7.30 21.49 -0.34 0.73432
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.46 on 199 degrees of freedom
(7 observations deleted due to missingness)
Multiple R-squared: 0.0681, Adjusted R-squared: 0.0447
F-statistic: 2.91 on 5 and 199 DF, p-value: 0.0147
options(digits=3)
tidy(vir3.lm, conf.int = TRUE)
glance(vir3.lm)
lm.beta.vir3 <- lm.beta(vir3.lm)
tidy(lm.beta.vir3, conf.int = TRUE)
Hood height effect detected for removals per flower, but model rsq<1%, and standardized estimate not particularly large Slit length detected for insertions per flower, but rsq again very low, 1.6%, and standardized estimate low Fruits also with poor model fit, rsq <5%; highly “significant” effect of hood length detected. Standardized estimates low, 0.24 - strongest for this species.
What do you conclude about the role of floral traits in these two species?
Different floral traits matter - the best predictor of each fitness component is different between the two species.
Is there anything you’d be cautious about in making this comparison?
Different detection abilities for the two species, because one is commoner than the other, and easier to get a large sample.
Watch out for students using statistical signficance only to guide conclusions
Different sensitivies probably mean that two of the 3 effects for A. viridiflora wouldn’t have been detected with the A. syriaca sample size
Standardized coefficients suggest weaker overall effects in A. viridiflora
Recall the sengi example from Chapter 5 (or go back and look at it ;-)). The research questions are about family differences. Could also look at this relationship between sengis and the rest. There are 2 or 3 groups of insectivores, sengis and closely (afrotherian) and more distantly (laurasiatherian) species, and the research question is about where sengis fit. We can frame this as 2 or 3 questions.
Does the new species (udzugwensis) fit within the pattern of other sengi?
Are sengi different from the other small insectivores in their brain size?
sengi vs all others, or
sengi vs closely-related vs distantly related insectivores
Get started by loading the kaufman data. In Chapter 5, you should have decided that log-transforming both variables was sensible, so lets also start by defining new variables logbrain and log body. That will make the coding tidier, without having to log things repeatedly.
df <- read.csv("data/kaufman.csv")
df$logbrain <- log(df$brainmass)
df$logbody <- log(df$bodymass)
For the first question, cast your mind back to Chapter 6. How would you decide whether the new species is unusual?
Use the existing species to calculate a linear regression, then calculate the predicted value for a species of the body mass of R. udzugwensis.
sengi <- filter(df, family == "Macroscelididae")
sengi
sengi.lm <- lm(data=filter(sengi, species != "udzugwensis"), logbrain ~ logbody)
glance(sengi.lm)
tidy(sengi.lm)
# Now predict values for *R. udzu..*
predict(sengi.lm, data.frame(logbody = c(6.565)), interval="prediction", se=T)
$fit
fit lwr upr
1 8.91 8.5 9.32
$se.fit
[1] 0.0605
$df
[1] 2
$residual.scale
[1] 0.0723
Predicted log brain mass is 8.91, and the prediction interval is 8.50 - 9.31, so this species is pretty much smack on the line.
** You could make this comparison in two ways:
**Before you start, are there any things to check in the original data, linked to the assumptions of the linear model you’ll fit?
The other important thing to note is that if we’re looking to compare sengis and other groups formally, we need to be careful about the ranges of body sizes. Sengi species start around the middle of the range, from 45g up. We’d probably restrict our comparison to species around this size - exclude those with body mass less than a threshold. Threshold would be arbitrary, but most people would choose 40 or 45g
ggplot(data=df, aes(x=log(bodymass), y=log(brainmass), colour = relation, shape = relation))+
geom_point()+ theme_light() + scale_color_uchicago()
Outline the steps you’ll take
- Fit complex model allowing slopes to vary among groups
- Assess whether the groups*logbody term should remain in the model
- If relationships can be treated as parallel, run simpler model with groups and logbody
- Examine results and decide whether to do any comparisons among groups
df$relation<-factor(df$relation)
df$relation2<-factor(df$relation2)
contrasts = list (relation = contr.sum)
# Create subset of file. We could keep running the filter each time we run the model, but doing it once is tidier.
df2 <- filter(df, bodymass > 40)
lm2<-aov(logbrain~logbody + relation + logbody*relation, data = df2)
glance(lm2)
tidy(lm2)
Anova(lm2, type='III')
Anova Table (Type III tests)
Response: logbrain
Sum Sq Df F value Pr(>F)
(Intercept) 9.49 1 126.81 7.8e-11 ***
logbody 3.41 1 45.58 6.9e-07 ***
relation 0.05 2 0.30 0.74
logbody:relation 0.07 2 0.45 0.64
Residuals 1.72 23
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
logbody*relation term contributing little to model fit, so run simpler model
lm3<-aov(logbrain~logbody + relation, data = df2)
glance(lm3)
tidy(lm3)
Anova(lm3, type='III')
Anova Table (Type III tests)
Response: logbrain
Sum Sq Df F value Pr(>F)
(Intercept) 23.29 1 326 7.6e-16 ***
logbody 12.99 1 182 5.8e-13 ***
relation 1.86 2 13 0.00013 ***
Residuals 1.79 25
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Get adjusted means
library(effects)
Use the command
lattice::trellis.par.set(effectsTheme())
to customize lattice options for effects plots.
See ?effectsTheme for details.
adjmeans <- effect("relation", lm3, se=TRUE)
summary(adjmeans)
relation effect
relation
afrotherian laurasiatherian sengi
7.17 7.34 7.91
Lower 95 Percent Confidence Limits
relation
afrotherian laurasiatherian sengi
7.01 7.18 7.66
Upper 95 Percent Confidence Limits
relation
afrotherian laurasiatherian sengi
7.34 7.50 8.16
ggplot(data=df2, aes(x=log(bodymass), y=log(brainmass), group= relation, colour = relation, shape = relation))+
geom_point()+ geom_smooth(method=lm) + theme_light() + scale_color_uchicago()
No need to proceed further. We conclude that slopes differ between the groups, and sengi are clearly above the other two. The other two groups are close, with abutting confidence intervals around the adjusted means.
lm4 <- lm(log(brainmass) ~ log(bodymass), data=df2)
df3 <- cbind(df2, lm4$residuals)
head(df3,10)
boxplot(lm4$residuals ~ relation, data = df3)
lm5 <- aov (lm4$residuals ~ relation, data = df3)
summary(lm5)
Df Sum Sq Mean Sq F value Pr(>F)
relation 2 1.80 0.901 12.6 0.00015 ***
Residuals 26 1.85 0.071
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Pattern also clear here, just from box plots
We’ll return to the elephant seal example in the study by LaBoeuf et al., and see whether body weight plays any role in foraging. In Chapter 5, you should have noticed that while the focus of the initial analysis was on the relationship between distance travelled and time spend on the foraging grounds, the authors recorded weight on departure for each animal. Your exploratory data analysis should have shown a relationship between body weight and the original predictor and response variables. Now try and make some sense of what’s going on here.
#Get the data file back
df <- read.csv("data/leboeuf.csv")
head(df,10)
Think about how body mass might influence distance travelled and how it might contribute to time on foraging areas
Mass could easily be linked to swimming speed, allowing larger animals to forage for longer. We could even get more complicated and speculate whether larger males may spend more time maintaining dominance on the shore, so might forage less. In either case, the two variables could be linked.
Body mass might also reflect overall condition, and, for example, males in poorer condidtion might need to spend longer feeding.
How will you assess whether including body weight as a second predictor helps us understand feeding time better?
Time on foraging = β0 + β1 Distance travelled + β2 Body weight
Need to check collinearity - use VIF
Check residuals
Check linearity
cor(df[,c('distance','departwt')])
distance departwt
distance 1 NA
departwt NA 1
laboeuf1.lm <- lm (FFAduration~ distance + departwt, data = df)
vif(laboeuf1.lm)
distance departwt
4.34 4.34
plot(laboef1.lm)
VIF value not small, but below the common threshold of 5
Nothing dramatic in residuals, though points 13, 14, and 16 have large residuals
augment(laboef1.lm)
Fit the appropriate model to the data, interpret the results, and explain whether body weight helps us.
Get the equation
\[ \operatorname{FFAduration} = \alpha + \beta_{1}(\operatorname{distance}) + \beta_{2}(\operatorname{departwt}) + \epsilon \]
glance(laboeuf1.lm)
tidy(laboef1.lm)
Get standarised regression coefficients
Use lm.beta function from lm.beta package
lm.beta.laboeuf <- lm.beta(laboef1.lm)
tidy(lm.beta.laboeuf, conf.int = TRUE)
glance(lm.beta.laboeuf)
Show the partial regression (added-variable) plots
avPlots(laboef1.lm, ask=F)
Conclusion is that there’s a strong statistical relationship. The model explains around 60% of variation, and that effect is largely associated with distance travelled. Time on foraging grounds falls sharply with distance.
Departure rate plays little role.
Is there anything else you might look at?
The relationship of distance to both response and the other predictor might make it worth looking for a non-additive model
laboeuf2.lm <- lm (FFAduration~ distance + departwt + distance*departwt, data = df)
vif(laboeuf2.lm)
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif
distance departwt distance:departwt
91.6 41.6 225.4
# High vif values, as expected, so centre predictors and run again
df$cdistance <- scale(df$distance, center=TRUE, scale=FALSE)
df$cdepartwt <- scale(df$departwt, center=TRUE, scale=FALSE)
laboeuf3.lm <- lm (FFAduration~ cdistance + cdepartwt + cdistance*cdepartwt, data = df)
vif(laboeuf3.lm)
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif
cdistance cdepartwt cdistance:cdepartwt
7.89 8.43 1.95
lm.beta.laboeuf3 <- lm.beta(laboeuf3.lm)
tidy(lm.beta.laboeuf3, conf.int = TRUE)
glance(lm.beta.laboeuf3)
NA
Actually a worse model; more issues with VIF, adjusted r-sq lower, and combined term contributes nothing